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Abstract 

Demixing problems in many areas such as hyperspectral imaging and differential opti- 
cal absorption spectroscopy (DOAS) often require finding sparse nonnegative linear com- 
binations of dictionary elements that match observed data. We show how aspects of 
these problems, such as misalignment of DOAS references and uncertainty in hyperspec- 
tral endmembers, can be modeled by expanding the dictionary with grouped elements 
and imposing a structured sparsity assumption that the combinations within each group 
should be sparse or even 1-sparse. If the dictionary is highly coherent, it is difficult to 
obtain good solutions using convex or greedy methods, such as non-negative least squares 
(NNLS) or orthogonal matching pursuit. We use penalties related to the Hoyer measure, 
which is the ratio of the l\ and I2 norms, as sparsity penalties to be added to the ob- 
jective in NNLS-type models. For solving the resulting nonconvex models, we propose a 
scaled gradient projection algorithm that requires solving a sequence of strongly convex 
quadratic programs. We discuss its close connections to convex splitting methods and dif- 
ference of convex programming. We also present promising numerical results for example 
DOAS analysis and hyperspectral demixing problems. 
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1 Introduction 



A general demixing problem is to estimate the quantities or concentrations of the individual 
components of some observed mixture. Often a linear mixture model is assumed [lj. In 
this case the observed mixture b is modeled as a linear combination of references for each 
component known to possibly be in the mixture. If we put these references in the columns of 
a dictionary matrix A, then the mixing model is simply Ax = b. Physical constraints often 
mean that x should be nonnegative, and depending on the application we may also be able 
to make sparsity assumptions about the unknown coefficients x. This can be posed as a basis 
pursuit problem where we are interested in finding a sparse and perhaps also non-negative 
linear combination of dictionary elements that match observed data. This is a very well 
studied problem. Some standard convex models are non-negative least squares (NNLS) [2, 3J, 



and methods based on l\ minimization [IJ[5j[6]. There are also variations that enforce different 
group sparsity assumptions on x [7J El [9] . 

In this paper we are interested in how to deal with uncertainty in the dictionary. The 
case when the dictionary is unknown is dealt with in sparse coding and non-negative ma- 
trix factorization (NMF) problems |10|. [TTj [T2| [TB"! [Ml 1 15 j . which require learning both the 
dictionary and a sparse representation of the data. We are, however, interested in the case 
where we know the dictionary but are uncertain about each element. One example we will 
study in this paper is differential optical absorption spectroscopy (DOAS) analysis [16], for 
which we know the reference spectra but are uncertain about how to align them with the 
data because of wavelength misalignment. Another example we will consider is hyperspectral 
unmixing |17 1 118 1 [19]. Multiple reference spectral signatures, or endmembers, may have been 
measured for the same material, and they may all be slightly different if they were measured 
under different conditions. We may not know ahead of time which one to choose that is most 
consistent with the measured data. Although there is previous work that considers noise in 
the endmembers [20] and represents endmembers as random vectors {21] , we may not always 
have a good general model for endmember variability. For the DOAS example, we do have 
a good model for the unknown misalignment [16], but even so, incorporating it may signifi- 
cantly complicate the overall model. Therefore for both examples, instead of attempting to 
model the uncertainty, we propose to expand the dictionary to include a representative group 
of possible elements for each uncertain element as was done in |22j . 

The grouped structure of the expanded dictionary is known by construction, and this 
allows us to make additional structured sparsity assumptions about the corresponding co- 
efficients. In particular, the coefficients should be extremely sparse within each group of 
representative elements, and in many cases we would like them to be at most 1-sparse. We 
will refer to this as intra group sparsity. If we expected sparsity of the coefficients for the 
unexpanded dictionary, then this will carry over to an inter group sparsity assumption about 
the coefficients for the expanded dictionary. By inter group sparsity we mean that with the 
coefficients split into groups, the number of groups containing nonzero elements should also be 
sparse. Modeling structured sparsity by applying sparsity penalties separately to overlapping 
subsets of the variables has been considered in a much more general setting in [81 [23] . 

The expanded dictionary is usually an underdetermined matrix with the property that it is 
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highly coherent because the added columns tend to be similar to each other. This makes it very 
challenging to find good sparse representations of the data using standard convex minimization 
and greedy optimization methods. If A satisfies certain properties related to its columns not 
being too coherent [24J, then sufficiently sparse non-negative solutions are unique and can 
therefore be found by solving the convex NNLS problem. These assumptions are usually not 
satisfied for our expanded dictionaries, and while NNLS may still be useful as an initialization, 
it does not by itself produce sufficiently sparse solutions. Similarly, our expanded dictionaries 
usually do not satisfy the incoherence assumptions required for l\ minimization or greedy 
methods like Orthogonal Matching Pursuit (OMP) to recover the Iq sparse solution |25} 126] . 
However, with an unexpanded dictionary having relatively few columns, these techniques can 
be effectively used for sparse hyperspectral unmixing [27] . 

The coherence of our expanded dictionary means we need to use different tools to find good 
solutions that satisfy our sparsity assumptions. We would like to use a variational approach as 
similar as possible to the NNLS model that enforces the additional sparsity while still allowing 
all the groups to collaborate. We propose adding nonconvex sparsity penalties to the NNLS 
objective function (fTJ). We can apply these penalties separately to each group of coefficients 
to enforce intra group sparsity, and we can simultaneously apply them to the vector of all 
coefficients to enforce additional inter group sparsity. From a modeling perspective, the ideal 
sparsity penalty is Iq. There is a very interesting recent work that directly deals with Iq 
constraints and penalties via a quadratic penalty approach [28] ■ If the variational model is 
going to be nonconvex, we prefer to work with a differentiable objective when possible. We 
therefore explore the effectiveness of sparsity penalties based on the Hoyer measure [291 ED] , 
which is essentially the ratio of l\ and I2 norms. In previous works, this has been successfully 
used to model sparsity in NMF and blind deconvolution applications [29], |31] [32] . We also 
consider the difference of l\ and I2 norms. By the relationship, ||x||i— ||x||2 = IM^jJfjj^ - 1)> we 
see that while the ratio of norms is constant in radial directions, the difference increases moving 
away from the origin except along the axes. Since the Hoyer measure is twice differentiable 
on the non-negative orthant away from the origin, it can be locally expressed as a difference 
of convex functions, and convex splitting or difference of convex (DC) methods [33] can be 
used to find a local minimum of the nonconvex problem. Some care must be taken, however, 
to deal with its poor behavior near the origin. It is even easier to apply DC methods when 
using l\ - I2 as a penalty, since this is already a difference of convex functions, and it is well 
defined at the origin. 

The paper is organized as follows. In Section [2] we define the general model, describe 
the dictionary structure and show how to use both the ratio and the difference of l\ and I2 
norms to model our intra and inter group sparsity assumptions. Section [3] derives a method for 
solving the general model, discusses connections to existing methods and includes convergence 
analysis. In Section 0] we discuss specific problem formulations for several examples related to 
DOAS analysis and hyperspectral demixing. Numerical experiments for comparing methods 
and applications to example problems are presented in Section 

2 Problem 

For the non- negative linear mixing model Ax = b, let b £ M. w , A E rWxN anc j x g 
with x > 0. Let the dictionary A have I2 normalized columns and consist of M groups, each 
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Figure 1: l\ and I2 unit balls 



with nij elements. We can write A 



Am] and x = \x\ ■ ■ ■ xm\ , where each 



xj S M m J and N = ^ 7 =i m j- The general non- negative least squares problem with sparsity 



constraints that we will consider is 



mmF(x) := -II Ac - 6|| 2 + R(x) 
x>o 2 



where 



R(x) 



M 

£ 



(3) 



The functions Rj represent the intra group sparsity penalties applied to each group of co- 
efficients Xj, j = 1, ...,M, and Rq is the inter group sparsity penalty applied to x. If F is 
differentiable, then a necessary condition for x* to be a local minimum is given by 



(y 



VF(x*) > Vy > . 



(4) 



For the applications we will consider, we want to constrain each vector Xj to be at most 
1-sparse, which is to say that we want ||xj||o < 1- To accomplish this through the model ([2]), 
we will need to choose the parameters jj to be sufficiently large. 

The sparsity penalties Rj and Rq will either be the ratios of l\ and I2 norms defined by 



H j (x j )=j j 



NI2 



or they will be the differences defined by 
Sj{xj) = ij{\\xj\\i - \\xj\\ 2 ) 



and 



and 



A geometric intuition for why minimizing promotes sparsity of x is that since it is 
constant in radial directions, minimizing it tries to reduce ||x||i without changing [|ac||2- As 
seen in Figure [TJ sparser vectors have smaller l\ norm on the l 2 sphere. 

Neither Hj or Sj is differentiable at zero, and Hj is not even continuous there. Figure [2] 
shows a visualization of both penalties in two dimensions. To obtain a differentiable F, we 



Hq(x) = 7o- 



Fill 
MI2 



So{x) = Jo(\\x\\i - \\x\\ 2 ) 



(5) 



(6) 
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Figure 2: Visualization of I1/I2 and l\ - I2 penalties 



can smooth the sparsity penalties by replacing the I2 norm with the Huber function, defined 
by the infimal convolution 



(x,e) = inf ||y||2 + —\\y - x\ 
y 2e 




if 1Mb < e 
otherwise . 



(7) 



In this way we can define differentiable versions of sparsity penalties H and S by 



{Xj , €j ) + g 

Hi 

(^e ) + f 



(8) 



s o( x ) = Todklli - ^{x,e Q )) 



(9) 



These smoothed sparsity penalties are shown in Figure [3j The regularized penalties behave 
more like l\ near the origin and should tend to shrink Xj that have small I2 norms. 

An alternate strategy for obtaining a differentiable objective that doesn't require smooth- 
ing the sparsity penalties is to add M additional dummy variables and modify the convex 
constraint set. Let d E M A/ , d > denote a vector of dummy variables. Consider applying 



Rj to vectors 



instead of to Xj. Then if we add the constraints + dj > ej, we are 



assured that Rj(xj,dj) will only be applied to nonzero vectors, even though Xj is still allowed 
to be zero. Moreover, by requiring that ^ • < M — r, we can ensure that at least r of the 
vectors Xj have one or more nonzero elements. In particular, this prevents x from being zero, 
so Ro(x) is well defined as well. 
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Figure 3: Visualization of regularized I1/I2 and l\ - I2 penalties 

The dummy variable strategy is our preferred approach for using the penalty. The 
high variability of the regularized version near the origin creates numerical difficulties. It 
either needs a lot of smoothing, which makes it behave too much like l±, or its steepness near 
the origin makes it harder numerically to avoid getting stuck in bad local minima. For the l\ 
- I2 penalty, the regularized approach is our preferred strategy because it is simpler and not 
much regularization is required. Smoothing also makes this penalty behave more like l\ near 
the origin, but a small shrinkage effect there may in fact be useful, especially for promoting 
inter group sparsity. These two main problem formulations are summarized below as Problem 
1 and Problem 2 respectively. 
Problem 1: 

1 M 

mm F H (x,d) := -\\Ax - b\\ 2 + } -/ j H j (x j , dj) + joH (x) 

x,d z * — ' 

j=l 

M d 

such that x > 0, d > 0, — < M — r and [|scj||i + dj > tj, j = 1, M . 

j 1 '•' 

Problem 2: 

1 M 
mmF s (x) := -\\Ax - bf +Y J ^ S K x i) + ^{x) . 

3 Algorithm 

Both Problems 1 and 2 from Section [2] can be written abstractly as 

mmF(x) :=-\\Ax -b\\ 2 + R(x), (10) 

x£X 2 
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where X is a convex set. Problem 2 is already of this form with X = {x £ 1^ : x > 0}. 
Problem 1 is also of this form, with X = {x £ R N ,d £ R A/ : x > 0, d > 0, + dj > 

Ej, < M — r}. Note that the objective function of Problem 1 can also be written as in 

(jlOP if we redefine Xj as ^ and consider an expanded vector of coefficients x 6 M Ar+M that 

includes the M dummy variables, d. The data fidelity term can still be written as ^||Ax — b\\ 2 
if columns of zeros are inserted into A at the indices corresponding to the dummy variables. 
In this section, we will describe algorithms and convergence analysis for solving (llOj) under 
either of two sets of assumptions. 

Assumption 1. 

• X is a convex set. 

• R(x) £ C 2 (X, R) and the eigenvalues of V 2 R(x) are bounded on X. 

• F is coercive on X in the sense that for any x° £ X, {x £ X : F(x) < F(x )} is a 
bounded set. In particular, F is bounded below. 

Assumption 2. 

• R(x) is concave and differentiable on X. 

• Same assumptions on X and F as in Assumption Q] 

Problem 1 satisfies Assumption 1 and Problem 2 satisfies Assumption 2. We will first 
consider the case of Assumption 1. 

Our approach for solving (jlOf) was originally motivated by a convex splitting technique 



from |344 [35] that is a semi-implicit method for solving = — Vi ? (x), x(0) = x° when i 7 
can be split into a sum of convex and concave functions F c (x) + F E (x), both in C^M^M). 
Let A^ ax be an upper bound on the eigenvalues of V 2 F E , and let A m i n be a lower bound on 
the eigenvalues of V 2 F. Under the assumption that A^ ax < ^A m i n it can be shown that the 
update defined by 

x n+i = x n + At(-VF c {x n+1 ) - VF E (x n )) (11) 

doesn't increase F for any time step At > 0. This can be seen by using second order Taylor 
expansions to derive the estimate 

F(x n+1 ) - F(x n ) < (Al x - \\C. a - - *1 2 . (12) 

This convex splitting approach has been shown to be an efficient method much faster than 
gradient descent for solving phase-field models such as the Cahn-Hilliard equation, which has 
been used for example to simulate coarsening [35] and for image inpainting [36] . 

By the assumptions on R, we can achieve a convex concave splitting, F = F c + F E , by 
letting F (x) = %\\Ax — b\\ 2 + \\x\\q and F E (x) = R(x) — \\x\\q for an appropriately chosen 
positive definite matrix C. We can also use the fact that F c (x) is quadratic to improve upon 
the estimate in (I12h when bounding F(x n+1 ) — F{x n ) by a quadratic function of x n+1 . Then 
instead of choosing a time step and updating according to (fTT|) . we can dispense with the 
time step interpretation altogether and choose an update that reduces the upper bound on 
F(x n+l ) — F(x n ) as much as possible subject to the constraint. This requires minimizing a 
strongly convex quadratic function over X. 
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Proposition 3.1. Let Assumption^ hold. Also let X r and Xr be lower and upper bounds 
respectively on the eigenvalues ofV 2 R(x) for x £ X. Then for x,y £ X and for any matrix 
C, 

F(y)-F(x) < (y- x f((X R -^X r )l-C)(y-x) + (y-x) T ^A T A + C)(y-x) + (y-x) T VF(x) . 

(13) 

Proof. The estimate follows from combining several second order Taylor expansions of F and 
R with our assumptions. First expanding F about y and using h = y — x to simplify notation, 
we get that 

F(x) = F(y) - h T VF(y) + hi T V 2 F{y - atih)h 
for some a± G (0, 1). Substituting F as defined by (fTOj) . we obtain 

F(y) - F(x) = h T (A T Ay - A T b + VR(y)) - hi T A T Ah - hi T V 2 R{y - a x h)h (14) 

Similarly, we can compute Taylor expansions of R about both x and y. 

R(x) = R(y) - h T VR(y) + hi T V 2 R{y - a 2 h)h . 

R(y) = R(x) + h T X7R(x) + hi T V 2 R{x + a 3 h)h . 
Again, both a 2 and are in (0, 1). Adding these expressions implies that 

h T {VR(y) - VR(x)) = hi T V 2 R(y - a 2 h)h + hi T V 2 R(x + a 3 h)h . 

From the assumption that the eigenvalues of V 2 R are bounded above by A# on X, 

h T (VR(y)-VR(x))<X R \\h\\ 2 . (15) 
Adding and subtracting h T VR(x) and h T A T Ax to (|14p yields 

F(y) - F(x) = h T A T Ah + h T (A T Ax - A T b + VR(x)) + h T (VR(y) - VR(x)) 
- hi T A T Ah - hi T V 2 R{y - axh)h 

= hi T A T Ah + h T X7F(x) + h T (VR(y) - VR(x)) - hi T V 2 R(y - a x h)h . 

Using (USD, 

F(y) - F(x) < hi T A T Ah + h T VF{x) - hi T V 2 R{y - a x h)h + X R \\h\\ 2 . 

The assumption that the eigenvalues of V 2 R(x) are bounded below by A r on X means 

F(y) - F(x) < (X R - ^X r )\\h\\ 2 + l -h T A T Ah + h T VF(x) . 

Since the estimate is unchanged by adding and subtracting h T Ch for any matrix C, the 
inequality in (|13p follows directly. □ 
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Corollary. Let C be symmetric positive definite and let X c denote the smallest eigenvalue of 
C ■ If A c > \r — \\ r , then for x, y G X , 

F(y) - F(x) <(y- xf(^A T A + C)(y - x) + (y - x) T VF(x) . 

A natural strategy for solving (jlOp is then to iterate 

x n+1 = argmin(x - x n ) T {\a t A + C n ){x - x n ) + (x - x n ) T VF(x n ) (16) 

for C n chosen to guarantee a sufficient decrease in F. The method obtained by iterating (fT6|) 
can be viewed as an instance of scaled gradient projection [3T|, [38l [39] where the orthogonal 
projection of x n — (A T A + 2C n )~ l VF(x n ) onto X is computed in the norm || • |U T A+2C„- The 
approach of decreasing F by minimizing an upper bound coming from an estimate like (j!3p 
can be interpreted as an optimization transfer strategy of defining and minimizing a surrogate 
function |40j , which is done for related applications in [12} [T3] . It can also be interpreted as 
a special case of difference of convex programming [33 j . 

Choosing C n in such a way that guarantees (x n+1 — x u ) t ((\r — ^A r )I — C n )(x n+1 — x n ) < 
may be numerically inefficient, and it also isn't strictly necessary for the algorithm to converge. 
To simplify the description of the algorithm, suppose C n = c n C for some scalar c n > and 
symmetric positive definite C. Then as c n gets larger, the method becomes more like explicit 
gradient projection with small time steps. This can be slow to converge as well as more prone 
to converging to bad local minima. However, the method still converges as long as each c n is 
chosen so that the x n+1 update decreases F sufficiently. Therefore we want to dynamically 
choose c n > to be as small as possible such that the x n+1 update given by (fT6|) decreases F 
by a sufficient amount, namely 

F(x n+1 ) - F(x n ) < a 

for some a 6 (0,1]. Additionally, we want to ensure that the modulus of strong convexity 
of the quadratic objective in (fl6|) is large enough by requiring the smallest eigenvalue of 
^A T A + C n to be greater than or equal to some p > 0. The following is an algorithm for 
solving (|!Up and a dymamic update scheme for C n = c n C that is similar to Armijo line search 
but designed to reduce the number of times that the solution to the quadratic problem has 
to be rejected for not decreasing F sufficiently. 



Algorithm 1: A Scaled Gradient Projection Method for Solving (jlOp Under Assumption Q] 

Define x° G X, c > 0, a G (0, 1], e > 0, p > 0, £i > 1,£ 2 > 1 and set n = 0. 

while n = or ||x n — i"" 1 ^ > e 

y = arg min (x - x n ) T {]-A T A + c n C)(x - x n ) + (x - x n ) T VF{x n ) 
igx 2 

if F(y) - F(x n ) >a[(y- x n ) T (\A T A + c n C)(y - x n ) + (y - x n ) T VF{x n )] 



{x n+l - x n ) T {^A T A + C n ){x n+1 - x n ) + (x n+1 - x n ) T VF(x n ) 
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else 



if smallest eigenvalue of j^C + \A T A is greater than p 



Cn+1 = I & — — ^ 6 ^,^^ ^ ^ 
1 c„ otherwise . 



n = n + 1 
end if 
end while 



It isn't necessary to impose an upper bound on c n in Algorithm [T] even though we want 
it to be bounded. The reason for this is because once c n > \r — \\ r , F will be sufficiently 
decreased for any choice of a G (0, 1], so c n is effectively bounded by ^{^R — |^r)- 

Under Assumption [2] it is much more straightforward to derive an estimate analogous to 
Proposition 13.11 Concavity of R(x) immediately implies 

R(y) < R(x) + {y- x) T VR(x) . 

Adding to this the expression 

±\\Ay - b\\ 2 = ±\\Ax - b\\ 2 + (y- xf(A T Ax - A T b) + - x) T A T A(y - x) 

yields 

F(y) - F{x) <(y- x) T ^A T A(y - x) + (y - x) T VF(x) (17) 

for x, y £ X. Moreover, the estimate still holds if we add (y — x) T C(y — x) to the right hand 
side for any positive semi-definite matrix C. We are again led to iterating (|16p to decrease 
F, and in this case C n need only be included to ensure that A T A + 2C n is positive definite. 
We can let C n = C since the dependence on n is no longer necessary. We can choose any C 
such that the smallest eigenvalue of C + \A T A is greater than p > 0, but it is still preferable 
to choose C as small as is numerically practical. 



Algorithm 2: A Scaled Gradient Projection Method for Solving (|10p Under Assumption [2] 
Define x° G X, C symmetric positive definite and e > 0. 
while n = or \\x n — i™ -1 ^ > e 

x n+l = argmin (x - x n ) T {-A T A + C){x - x n ) + (x - x n ) T VF(x n ) (18) 

x£X 2 

n = n + 1 
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end while 



Since the objective in (|T8[) is zero at x = x n , the minimum value is less than or equal to zero, 
and so F{x n+1 ) < F{x n ) by (fTTjl. 

Algorithm [2] is also equivalent to iterating 

x n+1 = argmin -\\Ax - b\\ 2 + \\x\\ 2 c + x T (VR{x n ) - 2Cx n ) , 

which can be seen as an application of the simplified difference of convex algorithm from [33] 
to F(x) = {t;\\Ax — b\\ 2 + WxWfj) — (-R(x) + ||x||^). The DC method in [33J is more general 
and doesn't require the convex and concave functions to be differentiable. 

With many connections to classical algorithms, existing convergence results can be applied 
to argue that limit points of the iterates {x n } of Algorithms [2] and [T] are stationary points of 
(fTUj) . We still choose to include a convergence analysis for clarity because our assumptions 
allow us to give a simple and intuitive argument. The following analysis is for Algorithm [T] 
under Assumption [TJ However, if we replace C n with C and a with 1, then it applies equally 
well to Algorithm [2] under Assumption [2j We proceed by showing that the sequence {x n } is 
bounded, ||x n+1 — x n \\ — > and limit points of {x 11 } are stationary points of (jlOp satisfying 
the necessary local optimality condition Q. 

Lemma 3.2. The sequence of iterates {x n } generated by Algorithmic is bounded. 

Proof. Since F(x n ) is non-increasing, x n £ {x G X : Fix) < ^(2;°)}, which is a bounded set 
by assumption. □ 

Lemma 3.3. Let {x n } be the sequence of iterates generated by Algorithm\7\ Then ||x n+1 — 
x n \\ -> 0. 

Proof. Since {F(x n )} is bounded below and non-increasing, it converges. By construction, 
x n+l satisfies 



(x n+1 - x n f{^A T A + C n )(x n+1 - x n ) + (x n+1 - x n ) T VF{x Tl 



< -{F{x n ) - F{x n+1 )) . 
a 



By the optimality condition for (]16l) . 

{y - x n+1 f {{A T A + 2C n ){x n+l - x n ) + VF(/)) > Vy G X . 
In particular, we can take y = x n , which implies 

{x n+1 - x n ) T {A T A + 2C n ){x n+1 - x n ) < -{x n+1 - x n ) T VF(x n ) . 

Thus 



n+l ^n\T 



1 .T . „ , / «-L1 »,% ,1 



[x — X 



{-A T A + C n ){x n+l - x n ) < -{F{x n ) - F(x n+1 )) . 



2 ' ~ a 

2' 



Since the eigenvalues of \A T A + C n are bounded below by p > 0, we have that 



/''II-''' _ 



n+l _ „n N 2 < -(F{x n ) - F{x n+l )) . 



a 
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The result follows from noting that 

lim \\x n+l - x n f < lim — {F{x n ) - F(x n+1 )), 

n— >oo n— >oo up 

which equals since {F(x n )} converges. □ 

Proposition 3.4. Any limit point x* of the sequence of iterates {x n } generated by Algorithm 
U\ satisfies (y — x*) T \7F(x*) > for all which means x* is a stationary point of ( tiOj) . 

Proof. Let x* be a limit point of {x n }. Since {x n } is bounded, such a point exists. Let {x Tlk } 
be a subsequence that converges to x*. Since ||x n+1 — x n \\ — > 0, we also have that x nk+1 — > x*. 
Recalling the optimality condition for (I16D . 

< (y - x nk+l ) T ((A T A + 2C nk ){x nk+l - x nk ) + VF(x nk )) < 
\\y - x nk+1 1| \\A T A + 2C nk || ||x nfc+1 - x nk \\ + {y- x nk+1 ) T V F{x nk ) VyGl. 

Following [37], proceed by taking the limit along the subsequence as — >• oo. 

\\y-x nk+1 \\\\x nk+1 -x nk \\\\A T A + 2C nk \\ -»• 

since ||x nfc+1 — x nfe || — > and ||^4 T j4 + 2C nfc || is bounded. By continuity of VF we get that 

(y-x*) T VF{x*) > VyGX. 

□ 

Each iteration requires minimizing a strongly convex quadratic function over the set X as 
defined in (|16|) . Many methods can be used to solve this, and we want to choose one that is 
as robust as possible to poor conditioning of \A T A + C n . For example, gradient projection 
works theoretically and even converges at a linear rate, but it can still be impractically slow. A 
better choice here is to use the alternating direction method of multipliers (ADMM) [41, 42J, 
which alternately solves a linear system involving ^A T A + C n and projects onto the constraint 
set. Applied to Problem 2, this is essentially the same as the application of split Bregman 
[43j to solve a NNLS model for hyperspectral demixing in [44]. We consider separately the 
application of ADMM to Problems 1 and 2. The application to Problem 2 is simpler. 

For Problem 2, (1161) can be written as 



x n+1 = argmin(x - x n ) T {]-A T A + C n ){x - x n ) + (x - x n ) T VF s (x n ) . 
x>o 2 

To apply ADMM, we can first reformulate the problem as 
mmg >0 (v) + (u-x n ) T (lA T A + C n ){u-x n ) + {u-x n ) T VF s {x n ) such that u = v, (19) 

u,v ~ 2 

f v > 

where g is an indicator function for the constraint defined by g>o(v) = < 

I oc otherwise 

Introduce a Lagrange multiplier p and define a Lagrangian 
L(n, v, p) = g> (v) + (u- x n ) T {\A T A + C n )(u - x n ) + (u- x n ) T VF s (x n ) +p T (u- v) (20) 
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and augmented Lagrangian 

L 5 (u,v,p) = L(u,v,p) + -\\u - v\\ 2 , 

where 5 > 0. ADMM finds a saddle point 

L(u* ,v* ,p) < L(u* ,v* ,p*) < L(u,v,p*) Vu,v,p 

by alternately minimizing L$ with respect to u, minimizing with respect to v and updating 
the dual variable p. Having found a saddle point of L, (u*, v*) will be a solution to (fT9j) and 
we can take v* to be the solution to (|16p . The explicit ADMM iterations are described in the 
following algorithm. 

Algorithm 3: ADMM for solving convex subproblem for Problem 2 
Define 5 > 0, v and p° arbitrarily and let k = 0. 
while not converged 

u k+l = x n + ^ A T A + + gQ-l ^ y k _ x nj _ p k_ VFs ( x r 

^ +1 = n> (" fc+1 + y 
p k+l =p k + 6 ( u k+l _ v k+lj 

k = k + 1 

end while 

Here n>o denotes the orthogonal projection onto the non-negative orthant. For this 
application of ADMM to be practical, (A T A + 2C n + 51) ~ 1 should not be too expensive to 
apply, and 5 should be well chosen. 

Since (|16j) is a standard quadratic program, a huge variety of other methods could also be 
applied. Variants of Newton's method on a bound constrained KKT system might work well 
here, especially if we find we need to solve the subproblem to very high accuracy. 

For Problem 2, (]16p can be written as 

(x n+1 , d n+1 ) = argmin(x - x n ) T {\a t A + (%)(x - x n ) + {d- d?) T C*{d - d n )+ 

x,d 2 

(x - x n ) T V x F H {x n ,d n ) + (d- d n ) T V d F H (x n ,d n ) . 

spect to x and d i 

with C% a diagonal matrix. It is helpful to 



Here, V x and represent the gradients with respect to x and d respectively. The matrix 

\C X " 

C n is assumed to be of the form C n = n nd 
represent the constraints in terms of convex sets defined by 



{ 


Xj 




d L 



e R m i +1 : \\ Xj \\i + dj > ej, xj > 0, dj > j j = 1, M , 
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M 



x. 







de 



and indicator functions gx e . and <7x for these sets. 

Let u and w represent x and <i. Then by adding splitting variables v x = u and v d = w we 
can reformulate the problem as 

min Tg Xe .(v xj ,v dj ) + 5x „H + (u - x n ) T {\A T A + C£)(« - x n ) + (w - d n ) T C d (w - d n )+ 

U,W,V x ,V d 3 J J ' I 



x - x n ) T V x F H {x n , d n ) + (w- d n ) T V d F H {x n , d n ) s.t. 



v x = u, v d = w . 



Adding Lagrange multipliers p x and p d for the linear constraints, we can define the augmented 
Lagrangian 

L s (u,w,v x ,v d ,p x ,p d ) = ^g x {v xj ,v dj ) + g X/3 (w) + (u - x a ) T (]-A T A + C*)(u - x n )+ 



(w - d n Y C«{w - d n ) + {x- x n Y V x F H (x n , d n ) + (w- d n Y V d F H (x n , d n )+ 

pl(u - v x )+pT(w - v d ) + ^\\u - v x \\ 2 + ^\\w- v d \\ 2 . 

Each ADMM iteration alternately minimizes Lg first with respect to (u,w) and then with 
respect to (v x ,v d ) before updating the dual variables (p x ,p d )- The explicit iterations are 
described in the following algorithm. 



Algorithm 4: ADMM for solving convex subproblem for Problem 1 
Define 5 > 0, v x , v®, p x and p d arbitrarily and let k = 0. 

Define the weights j3 in the projection Ilx^ by /3j = {tj\f (2C% + 51) jj)^ 1 j = 1, M. 
while not converged 

u k+l =x n + ( A T A + 2( jx + 5l yl ^ y k _ x n^ _ p k _ X7 x F H (x n , <T)) 

(2C d n + Sl)-m x , {{2C d n + 5iyHSv k d ~v\- V d F H (x n , d n ) + 2C d J) 



w 



k+1 



J %3 
l V dj. 



fc+l / „fc+l , P *3 

P k x +1 =p k x + 5(u k+1 -v k x +1 ) 



j = l,...,M 



P k d +1 



p k d + 5{w k+1 
k + 1 



v 



end while 



We stop iterating and let x n+ = v x and d n+ = v d once the relative errors of the primal 
and dual variables are sufficiently small. The projections II and ILv e can be efficiently 
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computed by combining projections onto the non-negative orthant and projections onto the 
appropriate simplices. These can in principle be computed in linear time |45| . although we 
use a method that is simpler to implement and is still only 0{n log n) in the dimension of the 
vector being projected. 

4 Applications 

In this section we introduce four specific applications related to DOAS analysis and hyper- 
spectral demixing. We show how to model these problems in the form of (jlOp so that the 
algorithms from Section [3] can be applied. 

4.1 DOAS Analysis 

The goal of DOAS is to estimate the concentrations of gases in a mixture by measuring over 
a range of wavelengths the reduction in the intensity of light shined through it. A thorough 
summary of the procedure and analysis can be found in [16J. 

Beer's law can be used to estimate the attenuation of light intensity due to absorption. 
Assuming the average gas concentration c is not too large, Beer's law relates the transmitted 
intensity /(A) to the initial intensity Iq{X) by 

J(A) = / (A)exp- CT ( A ) cL , (21) 

where A is wavelength, cr(A) is the characteristic absorption spectra for the absorbing gas and 
L is the light path length. 

If the density of the absorbing gas is not constant, we should instead integrate over the 
light path, replacing exp _CT ( A ) cZ/ by exp -0 "^ c i l ) dl , Yoi simplicity, we will assume the con- 
centration is approximately constant. We will also denote the product of concentration and 
path length, cL, by a. 

When multiple absorbing gases are present, acr(A) can be replaced by a linear combination 
of the characteristic absorption spectra of the gases, and Beer's law can be written as 

/(A) = / (A)exp-^ a ^ (A) . 

Additionally taking into account the reduction of light intensity due to scattering, com- 
bined into a single term e(A), Beer's law becomes 

/(A) = / (A)exp-^ a ^ (A) - e(A) . 

The key idea behind DOAS is that it is not necessary to explicitly model effects such as 
scattering, as long as they vary smoothly enough with wavelength to be removed by high pass 
filtering that loosely speaking removes the broad structures and keeps the narrow structures. 
We will assume that e(A) is smooth. Additionally, we can assume that Io(X), if not known, 
is also smooth. The absorption spectra (Xj (A) can be considered to be a sum of a broad part 
(smooth) and a narrow part, <jj = <j|? road + f j? arrow . Since <7? arrow represents the only narrow 
structure in the entire model, the main idea is to isolate it by taking the log of the intensity and 
applying high pass filtering or any other procedure, such as polynomial fitting, that subtracts 
a smooth background from the data. The given reference spectra should already have had 
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their broad parts subtracted, but it may not have been done consistently, so we will combine 
abroad and e(A) into a single term B(X). We will also denote the given reference spectra by 
Uj, which again are already assumed to be approximately high pass filtered versions of the 
true absorption spectra Oj. With these notational changes, Beer's law becomes 

/(A) = J (A) exp- ^ «iVi(A)-B(A) _ (22) 

In practice, measurement errors must also be modeled. We therefore consider multiplying 
the right hand side of (|22p by s(A), representing wavelength dependent sensitivity. Assuming 
that s(A) ~ 1 and varies smoothly with A, we can absorb it into B(X). Measurements may 
also be corrupted by convolution with an instrument function h(X), but for simplicity we 
will assume this effect is negligible and not include convolution with h in the model. Let 
J(A) = — ln(I(A)). This is what we will consider to be the given data. By taking the log, the 
previous model simplifies to 

J(A) = -ln(/ (A)) + J> i% -(A) + B(X) + 17(A), 
j 

where 17(A) represents the log of multiplicative noise, which we will model as being approxi- 
mately white Gaussian noise. 

Since Iq{X) is assumed to be smooth, it can also be absorbed into the B(X) component, 
yielding the data model 

J(A) = J> i% -(A) + B{\) + 77(A). (23) 
3 

4.1.1 DOAS Analysis with Wavelength Misalignment 

A challenging complication in practice is wavelength misalignment, i.e., the nominal wave- 
lengths in the measurement J(A) may not correspond exactly to those in the basis yj(X). We 
must allow for small, often approximately linear deformations Vj(X) so that yj(X + Vj(X)) are 
all aligned with the data J(A). Taking into account wavelength misalignment, the data model 
becomes 

J(A) = a 3Vj( X + v jW) + B(X) + 77(A). (24) 

3 

To first focus on the alignment aspect of this problem, assume B(X) is negligible, having 
somehow been consistently removed from the data and references by high pass filtering or 
polynomial subtraction. Then given the data J(A) and reference spectra {yj(X)}, we want to 
estimate the fitting coefficients {a?} and the deformations {vj(X)} from the linear model, 

M 

J(A) = Y,a jVj {\ + Vj(X)) + 77(A) , (25) 
i=i 

where M is the total number of gases to be considered. 

Inspired by the idea of using a set of modified bases for image deconvolution [22], we 
construct a dictionary by deforming each yj with a set of possible deformations. Specifically, 
since the deformations can be well approximated by linear functions, i.e., Vj(X) = pjX + qj, 
we enumerate all the possible deformations by choosing Pj,qj from two pre-determined sets 
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{Pi, • • • , Pk}i {Qi, ■ ■ ■ j Ql}- Let Aj be a matrix whose columns are deformations of the jth 
reference J/j(A), i.e., yj(X + -PfcA + Qj) for k = 1, • • • , if and / = 1, • • • , L. Then we can rewrite 
the model (|25j) in terms of a matrix-vector form, 



J = [Air ■ ■ ,A 



M 



Xl 



XM 



(26) 



where xj G M^ 1, and J G R w . 

We propose the following minimization model, 



arg min^ | 



.X! 



(27) 



s.t. Xj ^ 0, llxj [ | o ^ 1 3 — 1) • • • j M . 



The second constraint in (1271) is to enforce each to have at most one non-zero element. 



Having ||xj||o = 1 indicates the existence of the gas with a spectrum yj Its non-zero index 
corresponds to the selected deformation and its magnitude corresponds to the concentration 
of the gas. This la constraint makes the problem NP-hard. A direct approach is the penalty 
decomposition method proposed in [28], which we will compare to in Section [5j Our approach 
is to replace the lo constraint on each group with intra sparsity penalties defined by Hi in ([5]) 
or Sj in ([9]), putting the problem in the form of Problem 1 or Problem 2. The intra sparsity 
parameters 7^ should be chosen large enough to enforce 1-sparsity within groups, and in the 
absence of any inter group sparsity assumptions we can set 70 = 0. 

4.1.2 DOAS with Background Model 

To incorporate the background term from (I24p . we will add B S M. w as an additional unknown 
and also add a quadratic penalty ^||Qi?|| 2 to penalize a lack of smoothness of B. This leads 
to the model 

min -|| Ac + B- J\\ 2 + ^||Q£|| 2 + R(x) , 
x&x,B 2 2 

where R includes our choice of intra sparsity penalties on x. This can be rewritten as 



mm — 

x£X,B 2 





A I 




X 




J 






VaQ_ 




B 










+ R(x) . 



(28) 



This has the general form of (1101) with the two by two block matrix interpreted as A and 



interpreted as b. Moreover, we can concatenate B and the M groups Xj by considering B 
to be group xm+i and setting 7M+1 = so that no sparsity penalty acts on the background 
component. In this way, we see that the algorithms presented in Section [3] can be directly 
applied to (j2"g|). 

It remains to define the matrix Q used in the penalty to enforce smoothness of the es- 
timated background. A possible strategy is to work with the discrete Fourier transform or 
discrete cosine transform of B and penalize high frequency coefficients. Although B should 
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Figure 4: Functions used to define background penalty 



be smooth, it is unlikely to satisfy Neumann or periodic boundary conditions, so based on an 
idea in [36], we will work with B minus the linear function that interpolates its endpoints. Let 
L G R w/xW/ be the matrix representation of the linear operator that takes the difference of B 
and its linear interpolant. Since LB satisfies zero boundary conditions and its odd periodic 
extension should be smooth, its discrete sine transform (DST) coefficients should rapidly de- 
cay. So we can penalize the high frequency DST coefficients of LB to encourage smoothness of 
B. Let r denote the DST and let Wb be a diagonal matrix of positive weights that are larger 
for higher frequencies. An effective choice is diag(M / e)i = i 2 , since the index i = 0, .., W — 1 is 
proportional to frequency. We then define Q = Wb^L in ([25]) and can adjust the strength of 
this smoothing penalty by changing the single parameter a > 0. Figure 2] shows the weights 
Wb and the result LB of subtracting from B the line interpolating its endpoints. 

4.2 Hyperspectral Image Analysis 

Hyperspectral images record high resolution spectral information at each pixel of an image. 
This large amount of spectral data makes it possible to identify materials based on their 
spectral signatures. A hyperspectral image can be represented as a matrix Y G M WxP , where 
P is the number of pixels and W is the number of spectral bands. 

Due to low spatial resolution or finely mixed materials, each pixel can contain multiple 
different materials. The spectral data measured at each pixel, according to a linear mixing 
model, is assumed to be a non-negative linear combination of spectral signatures of pure 
materials, which are called endmembers. The list of known endmembers can be represented 
as the columns of a matrix A G M l/l/xAr . 

The goal of hyperspectral demixing is to determine the abundances of different materials 
at each pixel. Given Y, and if A is also known, the goal is then to determine an abundance 
matrix S G R NxP with S itj > 0. Each row of S is interpretable as an image that shows the 
abundance of one particular material at every pixel. Mixtures are often assumed to involve 
only very few of the possible materials, so the columns of S are often additionally assumed 
to be sparse. 
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4.2.1 Sparse Hyperspectral Demixing 

A simple but effective approach for hyperspectral demixing is NNLS, which here is to solve 

minllY- AS"|| 2 F , 

S>0 

were F denotes the Frobenius norm. Many other tools have also been used to encourage 
additional sparsity of S, such as l\ minimization and variants of matching pursuit |19j [4"4"] [27] 
I18j . If no spatial correlations are assumed, the demixing problem can be solved at each pixel 
independently. We can also add one of the nonconvex inter sparsity penalties defined by Hq 
in ([5]) or Sq in ([9]). The resulting problem can be written in the form 

min -\\Ax p - b p \\ 2 + R{x p ) , (29) 
x p >o 2 

where x p is the pth column of S and b p is the pth column of Y. We can define R(x p ) to equal 
Hq(x p ) or Sq(x p ), putting ([29]) in the general form of (fTUj) . 

4.2.2 Structured Sparse Hyperspectral Demixing 

In hyperspectral demixing applications, the dictionary of endmembers is usually not known 
precisely. There are many methods for learning endmembers from a hyperspectral image 
such as N-FINDR [UJ, vertex component analysis (VCA) [38], NMF [ID] . Bayesian methods 
|49[ 150] and convex optimization [15]. However, here we are interested in the case where we 
have a large library of measured reference endmembers including multiple references for each 
expected material measured under different conditions. The resulting dictionary A is assumed 
to have the group structure \A\, ■ ■ ■ ,Am], where each group Aj contains different references 
for the same jth material. 

There are several reasons that we don't want to use the sparse demixing methods of Section 
14.2.11 when A contains a large library of references defined in this way. Such a matrix A with 
many nearly redundant references will likely have high coherence. This creates a challenge 
for existing methods. The grouped structure of A also means that we want to enforce a 
structured sparsity assumption on the columns of S. The linear combination of endmembers 
at any particular pixel is assumed to involve at most one endmember from each group Aj. 
Linearly combining multiple references within a group may not be physically meaningful, since 
they all represent the same material. Restricting our attention to a single pixel p, we can 



write the pth abundance column x p of S as 



The sparsity assumption requires each 



_XM, P J 

group of abundance coefficients Xj iP to be at most one sparse. We can enforce this by adding 
sufficiently large intra sparsity penalties to the objective in ([29]) defined by Hj(xj tP ) ([5]) or 
S*(x j>p ) ®. 

We think it may be important to use an expanded dictionary to allow different endmem- 
bers within groups to be selected at different pixels, thus incorporating endmember variability 
into the demixing process. Existing methods accomplish this in different ways, such as the 
piece- wise convex endmember detection method in [21] , which represents the spectral data 
as convex combinations of endmember distributions. It is observed in [21] that real hyper- 
spectral data can be better represented using several sets of endmembers. Additionally, their 



19 



better performance compared to VCA, which assumes pixel purity, on a dataset which should 
satisfy the pixel purity assumption, further justifies the benefit of incorporating endmember 
variability when demixing. 

If the same set of endmembers were valid at all pixels, we could attempt to enforce row 
sparsity of S using for example the ^i j00 penalty used in [15], which would encourage the 
data at all pixels to be representable as non-negative linear combinations of the same small 
subset of endmembers. Under some circumstances, this is a reasonable assumption and could 
be a good approach. However, due to varying conditions, a particular reference for some 
material may be good at some pixels but not at others. Although atmospheric conditions are 
of course unlikely to change from pixel to pixel, there could be nonlinear mixing effects that 
make the same material appear to have different spectral signatures in different locations pQ. 
For instance, a nonuniform layer of dust will change the appearance of materials in different 
places. If this mixing with dust is nonlinear, then the resulting hyperspectral data cannot 
necessarily be well represented by the linear mixture model with a dust endmember added 
to the dictionary. In this case, by considering an expanded dictionary containing reference 
measurements for the materials covered by different amounts of dust, we are attempting 
to take into account these nonlinear mixing effects without explicitly modeling them. At 
different pixels, different references for the same materials can now be used when trying to 
best represent the data. 

The overall model should contain both intra and inter sparsity penalties. In addition to the 
one sparsity assumption within groups, it is still assumed that many fewer than M materials 
are present at any particular pixel. The full model can again be written as (|29p except with 
the addition of intra sparsity penalties. The overall sparsity penalties can be written either 
as 

M 

R(x p ,d p ) = ^27jHj(x j:P ,d jtP ) +~/ H (x p ) 
i=i 

or 

M 

r ( x p) = ^2 r yj s j J ( x j,p) + io s o( x p) ■ 



5 Numerical Experiments 

In this section, we evaluate the effectiveness of our implementations of Problems 1 and 2 on 
the four applications discussed in Section HJ The simplest DOAS example with wavelength 
misalignment from Section 14.1.11 is used to see how well the intra sparsity assumption is 
satisfied compared to other methods. Two convex methods that we compare to are NNLS 
(JT]) and a non-negative constrained l\ basis pursuit model like the template matching via l\ 
minimization in |51| . The l\ minimization model we use here is 

min ||x||i such that \\Ax — b\\ < r . (30) 

x>0 

We use MATLAB's lsqnonneg function, which is parameter free, to solve the NNLS model. 
We use Bregman iteration [6] to solve the l\ minimization model. We also compare to direct 
Iq minimization via penalty decomposition (Algorithm [5]). 
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The penalty decomposition method [28] amounts to solving (I27p by a series of minimization 
problems with an increasing sequence {pu}- Let x = [xi,--- ,xm], y = [yi, • ■ ■ ,Ym] an d 
iterate 

(x k+1 ,y k+1 ) = argmini||Ac - b\\ 2 + ^-\\x - y\\ 2 

s.t. yj >0, \\ yj \\ ^ 1 (31) 
p k+1 = ap k (for a > 1) . 

The pseudo-code of this method is given in Algorithm [5j 



Algorithm 5: A penalty decomposition method for solving (127ft 

Define p > 0, cr > 1, e , ej and initialize y. 

while ||x - y||oo > e G 
i=1 5 

while max{||a; 1 - z*~ :L || o, \\y l - y i-1 ||oo} > £; 
a* = (A T ^l + pId)~ 1 {A T b + pyO 
= 

for j = 1,-,M 

find the index of maximal x^, i.e., /j = argmax; Xj(Z) 

Set yj{lj) = max(xj(lj), 0) 
end for 
i = i+1; 
end while 

x = x\y = y\p = ap 
end while 



Algorithm [5] may require a good initialization of y or a slowly increasing p. If the max- 
imum magnitude locations within each group are initially incorrect, it can get stuck at a 
local minimum. We consider both least square (LS) and NNLS initializations in numerical 
experiments. Algorithms [T] and [2] also benefit from a good initialization for the same reason. 
We use a constant initialization, for which the first iteration of those methods is already quite 
similar to NNLS. 

We also test the effectiveness of Problems 1 and 2 on the three other applications discussed 
in Section HJ For DOAS with the included background model, we compare again to Algorithm 
We use the sparse hyperspectral demixing example to demonstrate the sparsifying effect of 
the inter sparsity penalties acting without any intra sparsity penalties. We compare to the l\ 
regularized demixing model in |19j using the implementation in [44 . To illustrate the effect 
of the intra and inter sparsity penalties acting together, we also apply Problems 1 and 2 to a 
synthetic example of structured sparse hyperspectral demixing. We compare the recovery of 
the ground truth abundance with and without the intra sparsity penalties. 

5.1 DOAS with Wavelength Alignment 

We generate the dictionary by taking three given reference spectra yj{\) for the gases HONO, 
N02 and 03 and deforming each by a set of linear functions. The resulting dictionary contains 
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HONO 



N02 



03 




Vj(X + P k X + Qi) for P k = -1.01 + 0.01A; (k = 1, • ■ ■ ,21), Qi = -1.1 + 0.1/ (I = 1, ■ • ■ ,21) and 
j = 1,2,3. Each yj £ M 1 ^ with W = 1024. The represented wavelengths in nanometers are 
A = 340 + 0.04038w, w = 0, .., 1023. We use odd reflections to extrapolate shifted references 
at the boundary. The choice of boundary condition should only have a small effect if the 
wavelength displacements are small. However, if the displacements are large, it may be a 
good idea to modify the data fidelity term to select only the middle wavelengths to prevent 
boundary artifacts from influencing the results. 

There are a total of 441 linearly deformed references for each of the three groups. In 
Figure[5l we plot the reference spectra of HONO, N02 and 03 together with several deformed 
examples. 

In our experiments, we randomly select one element for each group with random magnitude 
plus additive zero mean Gaussian noise to synthesize the data term J(X) £ M. w for W = 1024. 
Mimicking the relative magnitudes of a real DOAS dataset [52] after normalization of the 
dictionary, the random magnitudes are chosen to be at different orders with mean values of 
1, 0.1, 1.5 for HONO, N02 and 03 respectively. We perform three experiments for which the 
standard deviations of the noise are 0, .005 and .05 respectively. This synthetic data is shown 
in Figure [U 

The parameters used in the numerical experiments are as follows. NNLS is parameter free. 
For the l\ minimization method in (130p . ^= = .001, .005 and .05 for the experiments with 
noise standard deviations of 0, .005 and .05 respectively. For the direct Iq method (Algorithm 
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[5]), the penalty parameter p is initially equal to .05 and increases by a factor of a = 1.2 
every iteration. The inner and outer tolerances are set at 10 -4 and 10 -5 respectively. The 
initialization is chosen to be either a least squares solution or the result of NNLS. For Problems 
1 and 2 we define €j = .05 for all three groups. In general this could be chosen roughly on 
the order of the smallest nonzero coefficient expected in the jth. group. Recall that these ej 
are used both in the definitions of the regularized l\ - li penalties 5j in Problem 2 and in 
the definitions of the dummy variable constraints in Problem 1. We set jj = .1 and jj = .05 
for Problems 1 and 2 respectively and for j = 1, 2, 3. Since there is no inter sparsity penalty, 
7o = 0. For both Algorithms [T] and [2] we set C = 10 _9 I. For Algorithm [H which dynamically 
updates C, we set several additional parameters a = .1, £i = 2 and £2 = 10. These choices are 
not crucial and have more to do with the rate of convergence than the quality of the result. 
For both algorithms, the outer iterations are stopped when the difference in energy is less 
than 10 -8 , and the inner ADMM iterations are stopped when the relative errors of the primal 
and dual variables are both less than 10~ 4 . 

We plot results of the different methods in blue along with the ground truth solution in 
red. The experiments are shown in Figures [8] and E3 



5.2 DOAS with Wavelength Alignment and Background Estimation 

We solve the model (|28|) using I1/I2 and regularized l\ - I2 intra sparsity penalties. These 
are special cases of Problems 1 and 2 respectively. Depending on which, the convex set X 
is either the non-negative orthant or a subset of it. We compare the performance to the 
direct method (Algorithm [5]) and least squares. The dictionary consists of the same set of 
linearly deformed reference spectra for HONO, N02 and 03 as in Section f5.il The data J is 
synthetically generated by 

J(A) = .0121 yi (A) + .0011y 2 (A) + .0159y 3 (A) + _ 2 g34)4 + 77(A), 

where the references yj are drawn from columns 180, 682 and 1103 of the dictionary and 
the last two terms represent a smooth background component and zero mean Gaussian noise 
having standard deviation 5.5810" 5 . The parameter a in (|28p is set at 10 -5 for all the 
experiments. 

The least squares method for (|28j) directly solves 
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where A3 has only three columns randomly chosen from the expanded dictionary A, with one 
chosen from each group. Results are averaged over 1000 random selections. 

In Algorithm [5l the penalty parameter p starts at 10 -6 and increases by a factor of a = 1.1 
every iteration. The inner and outer tolerances are set at 10 -4 and 10 -6 respectively. The 
coefficients are initialized to zero. 

In Algorithms [U and [21 we treat the background as a fourth group of coefficients, after the 
three for each set of reference spectra. For all groups tj is set to .001. We set 7^ = .001 for 
j = 1,2,3, and 74 = 0, so no sparsity penalty is acting on the background component. We 
set C = 10 I for Algorithm [2] and C = 10 4 I for Algorithm [lj where again we use a = .1, 
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Figure 7: Method comparisons on synthetic DOAS data without noise. Computed coefficients 
(blue) are plotted on top of the ground truth (red). 
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Figure 8: Method comparisons on synthetic DOAS data: a = .005. Computed coefficients 
(blue) are plotted on top of the ground truth (red). 
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Figure 9: Method comparisons on synthetic DOAS data: a = .05. Computed coefficients 
(blue) are plotted on top of the ground truth (red). 
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Figure 10: Comparisons of how well the results of least squares, direct Iq, h/h and regularized 
l\ - 1% fit the data. 



£1 = 2 and £2 = 10. We use a constant but nonzero initialization for the coefficients x. The 
inner and outer iteration tolerances are the same as in Section 15.11 with the inner decreased 
to 10- 5 . 

Figure [10] compares how closely the results of the four methods fit the data. Plotted are 
the synthetic data, the estimated background, each of the selected three linearly deformed 
reference spectra multiplied by their estimated fitting coefficients and finally the sum of the 
references and background. 

The computed coefficient magnitudes and displacements are compared to the ground truth 
in Tabled) 

The dictionary perhaps included some unrealistically large deformations of the references. 
Nonetheless, the least squares result shows that the coefficient magnitudes are underestimated 
when the alignment is incorrect. The methods for the Iq, I1/I2 and regularized l\ - I2 models 
all produced good and nearly equivalent results. All estimated the correct displacements of 
HONO and 03, but not N02. The estimated amounts of HONO and N02 were correct. The 
amount of 03 was overestimated by all methods. This is because there was a large background 
component in the 03 reference. Even with background estimation included in the model, it 
should still improve accuracy to work with references that have been high pass filtered ahead 
of time. 

Although the methods for the Iq, I1/I2 and regularized l\ - I2 models all yielded similar 
solutions, they have different pros and cons regarding parameter selection and runtime. It is 
important that p not increase too quickly in the direct Iq method. Otherwise it can get stuck at 
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ground truth 


least squares 


^0 


k/h 


h - h 


ai (HONO coefficient) 


0.01206 


0.00566 


0.01197 


0.01203 


0.01202 


a2 (N02 coefficient) 


0.00112 


0.00020 


0.00081 


0.00173 


0.00173 


as (03 coefficient) 


0.01589 


0.00812 


0.01884 


0.01967 


0.01947 


vi (HONO displacement) 


0.01A - 0.2 


N/A 


0.01A - 0.2 


0.01A - 0.2 


0.01A - 0.2 


V2 (N02 displacement) 


-0.01A + 0.1 


N/A 


-0.09A - 0.9 


OA - 0.2 


OA - 0.2 


V3 (03 displacement) 


OA + 


N/A 


OA + 


OA + 


OA + 



Table 1: Comparison of estimated fitting coefficients and displacements for DOAS with back- 
ground estimation 
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Figure 11: Color visualization of urban hyperspectral image and hand selected endmembers 



a poor solution. For this DOAS example, the resulting method required about 200 iterations 
and a little over 10 minutes to converge. Algorithm [JJ for the I1/I2 model can sometimes waste 
effort finding splitting coefficients that yield a sufficient decrease in energy. Here it required 
20 outer iterations and ran in a few minutes. Algorithm [2] required 8 outer iterations and 
took about a minute. Choosing too large can also cause the I1/I2 and l\ - I2 methods to 
get stuck at bad local minima. On the other hand, choosing 7,- too small may result in the 
group 1-sparsity condition not being satisfied, whereas it is satisfied by construction in the 
direct Zn approach. Empirically, gradually increasing jj works well, but we have simply used 
fixed parameters for all our experiments. 

5.3 Hyperspectral Demixing with Inter Sparsity Penalty 

We use the urban hyperspectral dataset from [53]. Each column of the data matrix Y € 
jg> 187x94249 re p resen t s the spectral signature measured at a pixel in the 307 by 307 urban 
image shown in Figure [TTJ 

The data was processed to remove some wavelengths for which the data was corrupted, 
resulting in a spectral resolution reduced from 210 to 187. The six endmembers forming the 
columns of the dictionary A were selected by hand from pixels that appeared to be pure 
materials. These are also shown in Figure [TTJ The columns of both A and Y were normalized 
to have unit I2 norm. 

Algorithms [TJ and [2] were used to solve (|29p with I1/I2 and regularized l\ - I2 inter sparsity 
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NNLS 


h 


h/h 


h - h 


Fraction nonzero 
Sum of squares error 


0.4752 
1111.2 


0.2683 
19107 


0.2645 
1395.3 


0.2677 
1335.6 



Table 2: Fraction of nonzero abundances and sum of squares error for four demixing models 
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Figure 12: Estimated fraction planes for urban data using hand selected endmembers 



penalties respectively. These were compared to NNLS and h minimization [44], which solve 

min -\\Ax p - b p \\ 2 + j\\x p \\i (32) 

x p >0 I 

for each pixel p. The parameters were chosen so that the 1%, h/h and h - h approaches all 
achieved roughly the same level of sparsity, measured as the fraction of nonzero abundances. 
The sparsity and sum of squares errors achieved by the four models are tabulated in Table (2) 
The l\ penalty promotes sparse solutions by trying to move coefficient vectors perpendicular 
to the positive face of the h ball, shrinking the magnitudes of all elements. The h/h penalty, 
and to some extent h - h, promote sparsity by trying to move in a different direction, tangent 
to the h ball. They do a better job of preserving the magnitudes of the abundances while 
enforcing a similarly sparse solution. This is reflected in their lower sum of squares errors. 

The results of these demixing algorithms are also represented in Figure [12] as fraction 
planes, which are the rows of the abundance matrix visualized as images. They show the 
spatial abundance of each endmember. 
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Figure 13: Candidate endmembers (blue) for romaine lettuce at 4,5,6 and 7 weeks from Sali- 
nas dataset, normalized averages (red) and candidate endmembers farthest from the average 
(green) 



5.4 Hyperspectral Demixing with Intra and Inter Sparsity Penalties 

In this Section we consider a hyperspectral demixing example with an expanded dictionary 
consisting of groups of references, each group consisting of candidate endmembers for a par- 
ticular material. The data we use for this examples is from [54J and consists of a 204 band 
hyperspectral image of crops, soils and vineyards in Salinas Valley, California. Using a given 
ground truth labeling, we extract just the data corresponding to romaine lettuce at 4, 5, 6 and 
7 weeks respectively. For each of these four groups, we remove outliers and then randomly 
extract 100 representative signatures. These and their normalized averages are plotted in 
Figure [13] and give a sense of the variability of the signatures corresponding to a particular 
label. 



By concatenating the four groups of 100 signatures we construct a dictionary A. 



x400 



group 



. We also construct two smaller dictionaries j4 m ean 

and A had S M 2U4X4 . The columns 
of 

j4 m ean are the average spectral signatures shown in red in Figure 1131 and the columns of 
^4bad are the candidate signatures farthest from the average shown in green in Figure [131 

Synthetic data b E ]j204xi560 wag constructed by randomly constructing a ground truth 
abundance matrix S group G R 400xl56 ° with 1000 1 -sparse columns, 500 2-sparse columns, 50 
3-sparse columns and 10 4-sparse columns, with each group of 100 coefficients being at most 
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1-sparse. Zero mean Gaussian noise rj with standard deviation .005 was also added so that 



b — ^group "Sgroup ~i~ V • 

Each k-sparse abundance column was constructed by first randomly choosing k groups, then 
randomly choosing one element within each of the selected groups and assigning a random 
magnitude in [0,1]. The generated columns were then rescaled so that the columns of the 
noise free data matrix would have unit I2 norm. 

Define T G jj 4x400 to be a block diagonal matrix with 1 by 100 row vectors of ones as the 
blocks. 

1---1 

1---1 

1---1 

1 . . . 1_ 

Applying T to .Sgroup lets us construct a ground truth group abundance matrix S £ jj 4x 1.560 
by summing the adundances within groups. For comparison purposes, this will allow us to 
apply different demixing methods using the different sized dictionaries ^4. me an; ^-group and -*4bad 
to compute S meSLn , T5g roup and S^ad respectively, which can all then be compared to S. 
We compare six different demixing methods using the three dictionaries: 

1. NNLS CQ using A mean , -4 g rou P and ,4 ba d 

2. l\ (|32]) using 

^4-meanj ^.group and -Abad 

3. I1/I2 (Problem 1) inter sparsity only, Using ^mean and ^Ibad 

4. l\ - I2 (Problem 2) inter sparsity only, using ^4 mea n and Ab a d 

5. I1/I2 intra and inter sparsity, using A group 

6. l\ - I2 intra and inter sparsity, using ^4 gr0 up 

For l\ demixing, we set 7 = .1 for A mean and ^4bad and 7 = .001 for A gTonp . In all applications 
of Algorithms [T] and [21 we use a constant but nonzero initialization and set €j = .01, 70 = .01 
and C = 10~ 9 I. For the applications with intra sparsity penalties, 7,- = .0001 for j = 1, 2, 3, 4. 
Otherwise 7^ = 0. For Algorithm CO we again use a = .1, £1 = 2 and £2 = 10. We stop 
iterating when the difference in the objective is less than .001. 

We compare the computed group abundances to the ground truth S in two ways in Table 
[3l Measuring the Iq norm of the difference of abundance matrices indicates how accurately 
the sparsity pattern was estimated. For each material, we also compute the absolute value of 
each group abundance error averaged over all measurements. For visualization, we plot the 
computed number of nonzero entries versus the ground truth for each column of the group 
abundances in Figure [HI 

We see in Table [3] and Figure [H] that NNLS did a poor job at finding sparse solutions 
although average coefficient errors were low. On the other hand, l\ minimization did a good 
job of finding a sparse solution, but coefficient errors were higher because the abundance 
magnitudes were underestimated. The and l\ — I2 minimization approaches were better 
at encouraging sparse solutions while maintaining small average errors in the abundance 
coefficients. 
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Figure 14: Estimated number of nonzero entries in each abundance column (blue) and ground 



truth (red). Row 1: S D 



Row 2: TS, 



group • 



Row 3: Shad- 
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NNLS 


h 


h/h 


h - h 


tJ T~n a n kD 


1537 


935 


786 


784 


pmean 
1 


0.0745 


0.1355 


0.0599 


0.0591 


p mean 
2 


0.0981 


0.1418 


0.0729 


0.0722 


^mcan 


0.0945 


0.1627 


0.0865 


0.0868 


^mcan 
4 


0.0542 


0.1293 


0.0514 


0.0492 


II giuup 1 1 \J 


1814 


851 


889 


851 


^igroup 


0.0624 


0.1280 


0.0717 


0.0691 


4 roup 


0.0926 


0.1300 


0.0877 


0.0782 


^.group 


0.1066 


0.1625 


0.1147 


0.1049 


4 roup 


0.0618 


0.1249 


0.0681 


0.0621 


Sbad — 5 


2123 


1093 


1134 


1076 


£bad 


0.0804 


0.1391 


0.0666 


0.0633 


J^bad 


0.1410 


0.1353 


0.0900 


0.0768 


^bad 


0.1400 


0.1733 


0.1000 


0.1046 


^■bad 


0.0759 


0.1540 


0.0646 


0.0713 



Table 3: Errors between computed group abundance and ground truth S, where 
E— = 1 Ep=l |Smean0» - S(j,p)\, E?«* = £ £j =1 \(TS gIoup )(j,p) - S(j,p)\, = 

t Ep=i l-s'badO'.p) - S(j,p)\, 



For this example, the average signatures used in ^4 mcan turned out to be good choices 
for the endmembers, and we didn't see any improvement in the estimated group abundances 
by considering the expanded dictionary ^4g roU p- However, compared to using the four poorly 
selected endmember candidates in ^4bad> we got better results with the expanded dictionary. 
In the expanded dictionary case, which resulted in an underdetermined dictionary matrix, the 
abundances Sgroup directly computed by l\ minimization were much less sparse than those 
computed by ^ and Zi — I2 minimization. This is because and l\ — I2 minimization were 
able to enforce 1-sparsity within coefficient groups, but l\ was not. If the group 1-sparsity 
requirement is important for the model to be accurate, then this is an advantage of using the 
^- and l\ — 12 penalties. Here, this difference in sparsity turned out not to have much effect on 
the group abundances TS group , which were computed by summing the abundances within each 
group. This may not hold in situations where the endmember variability is more nonlinear. 
For example, if the endmember variability had to do with misalignment, as with the earlier 
DOAS example, then linear combinations of misaligned signatures would not produce a good 
reference signature. 

6 Conclusions and Future Work 

We proposed a method for linear demixing problems where the dictionary contains multiple 
references for each material and we want to collaboratively choose the best one for each 
material present. More generally, we showed how to use and l\ — I2 penalties to obtain 
structured sparse solutions to non-negative least squares problems. These were reformulated 
as constrained minimization problems with differentiable but non-convex objectives. A scaled 
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gradient projection method based on difference of convex programming was proposed. This 
approach requires solving a sequence of strongly quadratic programs, and we showed how these 
can be efficiently solved using the alternating direction method of multipliers. Moreover, few 
iterations were required in practice, between 4 and 20 for all the numerical examples presented 
in this paper. Some convergence analysis was also presented to show that limit points of the 
iterates are stationary points. Numerical results for demixing problems in differential optical 
absorption spectroscopy and hyperspectral image analysis show that our difference of convex 
approach using and l\ — 1% penalties is capable of promoting different levels of sparsity on 
possibly overlapping subsets of the fitting or abundance coefficients. 

For future work we would like to test this method on more general multiple choice quadratic 
knapsack problems, which are related to the applications presented here that focused on find- 
ing solutions that were at most 1-sparse within specified groups. It would be interesting to 
see how this variational approach performs relative to combinatorial optimization strategies 
for similar problems. We are also interested in exploring alternative sparsity penalties that 
can be adapted to the data set. When promoting 1-sparse solutions, the experiments in this 
paper used fixed sparsity parameters that were simply chosen to be sufficiently large. We are 
interested in justifying the technique of gradually increasing this parameter while iterating, 
which empirically seems better able to avoid bad local minima. The applications presented 
here all involved uncertainty in the dictionary, which was expanded to include multiple can- 
didate references for each material. If a-priori assumptions are available about the relative 
likelihood of these candidates, we would like to incorporate this into the model. 

Acknowledgments - We thank Lisa Wingen for providing DOAS references and data, which 
we used as a guide when generating synthetic data for some of our numerical examples. We 
also thank John Greer for pointing out a paper by A. Zare and P. Gader [2TJ. 
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